Background and Context

You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.

A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.

One of the ways to expand the customer base is to introduce a new offering of packages.

Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.

However, the marketing cost was quite high because customers were contacted at random without looking at the available information.

The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.

However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.

You as a Data Scientist at "Visit with us" travel company have to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced travel package.

Problem Statement

To predict which customer is more likely to purchase the newly introduced travel package.

Data Description

Customer details:

CustomerID: Unique customer ID

Customer interaction data:

Index

Dataset Overview

EDA

Observation - 1:

Observation - 2:

Observation - 3

Observation - 4

Data Pre-Processing

Missing Value Treatment

Both Outlier and missing values were treated and now we can start the rest of the processes.

Univariate Analysis

Duration of Pitch has outliers

Number of Trips also has outliers.

Majority of the customers are Male around 59.7%

49.1% of customers visited with 3 members followed by 2.

42% of customers have followed up at least 4 times followed by 30% of customers by 3 times.

Basic and Deluxe products were the one that were pitched by the team.

Majority of the customers are married.

30% of the customers have taken at least 2 trips followed by 22.1% of them taking at least 3 trips.

Only 29.1% customers have passport and majority of them don't have passport.

30% of customers have an average 3 pitch satisfaction score.

62% of the customers own a car

42.6% of them were accompanied by 1 children

Bivariate Analysis

Among those who have taken the product, monthly income is between ~20K to 40K.

CustomerProfile

Among the basic package pitched, a majority of them (235) of them have taken the product and those customers have been followed up by 4 times.

Among the Deluxe package pitched, a majority of them (78) of them have taken the product and those customers have been followed up by 4 times.

Among the King package pitched, a majority of them (94) of them have taken the product and those customers have been followed up by 4 times.

Generally it appears that it takes 4 follow-ups with the customer inorder to sell the product.

Among those who has taken the product, it appears that the monthly income range is between ~18K to ~35K

Customer Profile

Number of Children, MonthlyIncome, ProductPitched,Designation, NumberOfChildrenVisited and NumberOfPersonVisited seems to have positive correlation.

A total of 920 customers out of 4888 have taken the product (~18.8%)

Split the dataset

We have 3421 observations in the training set and 1467 in the testing set.

Building Models

Decision Tree

We will use recall metrics to calculate true positives to actual positives. High recall implies false low negatives.

According to the decision tree model, Duration of Pitch, Age and Monthly Income are the most important ones. Since the tree is complex it can be construed that the tree often overfits.

Recall reduced from 1 to 0.17 in the training set and although the value is less, the tree is not overfitting and we have a generalized model.

Earlier Duration of Pitch was on top of the model but after limiting the depth to 3, we have Passport and Designation and Marital status becomes important features.

We will use hyperparameter tuning to tune the decision tree model.

Recall has improved on both training and test dataset after hyperparameter tuning.

Bagging Classifier

Accuracy on trainiang set increased but on test set it decreased. The model is overfitting the data in my opinion.

Bagging classifier with logistic regression as base_estimator is not overfitting the data but the test recall is very low. It gets really difficult to interpret without feature importance attribute.

Random Forest Classifier

With default parameters:

Both models - Bagging classifiers as well as random forest classifier are overfitting the train data. Both models are giving similar performance in terms of accuracy but bagging classifier is giving better recall.

Model Evaluation

Adaboost, XGBoost, Gradient Boost and Stacking Classifier

Monthly Income, Age, Duration of Pitch and Number of Trips are important.

Test accuracy and test recall have increased slightly. As we are getting better results, we will use init = AdaBoostClassifier() to tune the gradient boosting model.

models = [bagging_estimator,bagging_estimator_tuned,bagging_lr,rf_estimator,rf_estimator_tuned, rf_estimator_weighted]

Comparing all models

Insights

Insights & Recommendations